Analyzing product reviews is essential as it gives insights on how people feel, think, and react to a certain product. For this project, we will be analyzing reviews of two main products, Apple and Samsung phones, each of which has different models. The main purpose of this project is to analyze and compare the interaction of the consumers on Amazon’s website with each of the models of those two phone brands. and how they describe them.

We will be studying the interaction by analyzing the difference in review texts between these two brands by consumers over time and distinguishing similarities as well as differences in the vocabulary used to describe the phone from all perspectives such as functionality, memory, state, etc. To get a better overview and more relevant results with respect to the technology used and the development of mobile phones, we will be focusing only on the newest three generations for each brand; Apple and Samsung phones that have been introduced in the years 2019, 2020, and 2021. The following are the models that this project will focus on:

We will start by describing the data that is scraped from Amazon and the procedures taken to treat, clean, and prepare the data to transform it into a corpus with some graphical analysis. Further, we will perform a sentiment analysis to understand how consumers feel and react to these brands generally and models specifically. Then we will use unsupervised learning techniques, such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA), to identify clusters and similarities in the reviews and to analyze topic attribution based on the model being reviewed. Finally, we will use supervised learning techniques, such as Random Forest, to predict future values such as sentiment prediction based on the brand. All the above will be used to conclude the consumers reactions to each of the brands and their models.

Data Preparation

In this section, we will explain the steps we did for the retrieval of the data from the US Amazon marketplace. In Addition, we will explain some of the tasks we had to apply to the text reviews in order to have the final tables. It is worth mentioning that due to the heavy computation on this section we will present the final outcomes and the examples (before and after) of the wrangling and cleaning. The code can be observed on the scrapping.RMD file.

Data Retrieval

During this section, we have scrapped from the US Amazon marketplace the smartphone reviews of Apple and Samsung phones. We have decided to retrieve the reviews of the eleventh, twelfth, and the thirteenth generations of Apple. Also, we have consider the Generation S20, 2S21, and S22 for Samsung. The phones concerning each of the generation with the amount of reviews are described as follows:

  • Apple - 12,965 obs
    • Generation 11th - 10,276 obs
      • iPhone 11 - 5,414 obs
      • iPhone 11 Pro - 2,844 obs
      • iPhone 11 Pro Max - 2,414 obs
    • Generation 12th - 2,239 obs
      • iPhone 12 - 975 obs
      • iPhone 12 Mini - 779 obs
      • iPhone 12 Pro - 376 obs
      • iPhone 12 Pro Max - 109 obs
    • Generation 13th - 450 obs
      • iPhone 13 - 152 obs
      • iPhone 13 Mini - 146 obs
      • iPhone 13 Pro - 68 obs
      • iPhone 13 Pro Max - 84 obs
  • Samsung - 2,606 obs
    • Generation S20 - 177 obs
      • Samsung Galaxy S20 FE - 63 obs
      • Samsung Galaxy S20 Plus - 85 obs
      • Samsung Galaxy Note S20 Ultra - 29 obs
    • Generation S21 - 2,166 obs
      • Samsung Galaxy S21 FE - 1,775 obs
      • Samsung Galaxy S21 Plus - 361 obs
      • Samsung Galaxy S21 Ultra - 30 obs
    • Generation S22 - 263 obs
      • Samsung Galaxy S22 - 47 obs
      • Samsung Galaxy S22 Plus - 62 obs
      • Samsung Galaxy S22 Ultra - 154 obs

From this numbers we can say that Apple mobile phones are more popular in the US than Samsung phones, as we observe that users are more willingly to review the phones from this brand. Furthermore, the Generation 11 of Apple had the largest amount of reviews compared to the rest with the iPhone 11 at the top of the list.

Amazon US smartphones reviews - Apple and Samsung
Brand Model Reviews
12983 Samsung Samsung Galaxy S20 FE Good phone with terrible battery
12984 Samsung Samsung Galaxy S20 FE Phone overheat alot
12985 Samsung Samsung Galaxy S20 FE No issue with the phone at all, gave this as a gift to my boyfriend for Christmas. Works amazing! Thank you so much!
12986 Samsung Samsung Galaxy S20 FE It was not full unlocked as described.
12987 Samsung Samsung Galaxy S20 FE Excelente terminal, muy buena relación, calidad precio. Recomendado
12988 Samsung Samsung Galaxy S20 FE Mi hijo estaba muy contento. Súper recomendable
12989 Samsung Samsung Galaxy S20 FE Me gusto mucho el teléfono, pero lo compre supuestamente desbloqueado para todas las compañías y no fue así, no lo puedo usar con Verizon y sprint y nunca dijeron nada a respecto eso fue lo que me decepcionó.
12990 Samsung Samsung Galaxy S20 FE The product worked as described by seller
12991 Samsung Samsung Galaxy S20 FE rendimiento espectacular y fotos increíbles!!
12992 Samsung Samsung Galaxy S20 FE Good

Reviews Language Detection (Text Classification)

After scrapping the reviews from the smartphones we noticed that some of them were written in languages different from English. For that reason, we decided to use Transformers from the Hugging Face 🤗 website throughout the pipelines() function in order to apply tasks such as text classification and text translations. First, we applied the classification task from the model eleldar/language-detection, which is a fine-tuned version of xlm-roberta-base on the Language Identification dataset. By using this model we were able to detect the language of the reviews from our dataset.

After the application of the task text-classification and the model to the dataset, we created a column called Language to determine the language and the amount of reviews of each of them.

Appe reviews - Language Detection
Brand Model Reviews Language
1239 Apple iPhone 11 Pro Max This is the second phone from Amazon I’ve tried to get onto my T-Mobile/Sprint account. Amazon and Apple swear it’s fully unlocked but multiple people at T-Mobile say I can’t use this phone because it’s locked to Verizon. Just buy from your carrier lol en
1240 Apple iPhone 11 Pro Max Amazing! Yes It’s Legit &’ Works As Advertised 10/10 en
1241 Apple iPhone 11 Pro Max El celular tenias muchos rayones notorios en la pantalla, se q es usado pero no me pareció confiable después de ver esas imperfecciones tan notorias. es
1242 Apple iPhone 11 Pro Max El producto no es 100% desbloqueado es
1243 Apple iPhone 11 Pro Max This type is good en
1244 Apple iPhone 11 Pro Max But the celular that I bought is not original, they sent me another charger that is not the celular one. others besides they made me pay dearly and without guarantee … I did not understand that. Thank you en
Samsung reviews - Language Detection
Brand Model Reviews Language
51 Samsung Samsung Galaxy S20 FE Love love love this phone. Absolutely no problems with it. en
52 Samsung Samsung Galaxy S20 FE A date mon garçon ne l a pas tuer…et c’est un gros gamer… fr
53 Samsung Samsung Galaxy S20 FE i think i made a very wise decision when I bought this phone. with the features and specifications that it has, no wonder some say that it was the best phone in its range. if you are looking for a high spec but has a limited budget, this phone is the best choice! en
54 Samsung Samsung Galaxy S20 FE Battery is worse than battery test :)) en
55 Samsung Samsung Galaxy S20 FE This arrived in just a few days via FEDEX with no problems at all. Be aware that the charger included is the massive European plug from Samsung. You will need an adapter. As far as the phone, it is a great phone with a great camera. Well worth the budget-friendly price compared to the more expensive options available. en
56 Samsung Samsung Galaxy S20 FE 概ね満足です。 不備などはありませんでした。 ja
57 Samsung Samsung Galaxy S20 FE マイネオauシムで使用マイネオも5gが12月開始ですがとりあえず4gで使用 ja

Reviews Language Translation

For this part, we used the task for text translations from the model Helsinki-NLP/opus-mt-es-en, which helped us to translate the reviews written in Spanish to English. We have only considered to translate this language because it was the second most representative language in our data. Languages like French, Japanese, and Hindi that we identified in our data had less than 10 observations.

For visualization purposes, we have combined both tables (input and output), to show how the translation was executed. This means that we kept only the reviews translated to English for our Analysis.

Apple reviews - Language Translation
Brand Model Reviews Language
a.10 Apple iPhone 11 Pro Max El teléfono estaba desbloqueado, batería 90%. Muy bueno. es
b.10 Apple iPhone 11 Pro Max The phone was unlocked, battery 90%. en
a.11 Apple iPhone 11 Pro Max El celular funciona regularmente bien, el sensor de proximidad no estaba funcionando bien pero no era mayor problema…. La batería estaba consumida pero definitivamente creo es la mejor batería que hecho iPhone dura todoooo el día, no recomiendo celulares usados por ese precio deberían costar menos ya que siempre viene con problemas los cuales result caro solucionarlos al menos en Ecuador es
b.11 Apple iPhone 11 Pro Max The cell phone works regularly well, the proximity sensor wasn’t working well but it wasn’t a major problem…. The battery was consumed but I definitely think it’s the best battery that made iPhone lasts alloooo the day, I don’t recommend used phones for that price should cost less as it always comes with problems which result expensive to fix them at least in Ecuador en
a.12 Apple iPhone 11 Pro Max Todo el telefono esta exelente desbloqueado y trabaja bien es
b.12 Apple iPhone 11 Pro Max The whole phone is exelent unlocked and works well. en
Samsung reviews - Language Translation
Brand Model Reviews Language
a.1 Samsung Samsung Galaxy S20 FE Buen producto me gusto mucho es
b.1 Samsung Samsung Galaxy S20 FE Good product I liked very much en
a.2 Samsung Samsung Galaxy S20 FE Buen teléfono, aunque no es la versión 5g, y los marcos son más grandes de lo que esperaba es
b.2 Samsung Samsung Galaxy S20 FE Good phone, though it’s not version 5g, and the frames are bigger than I expected. en
a.3 Samsung Samsung Galaxy S20 FE Excelente terminal, muy buena relación, calidad precio. Recomendado es
b.3 Samsung Samsung Galaxy S20 FE Excellent terminal, very good ratio, quality price. Recommended en


After dealing with the steps explained above, we have transformed our files from having reviews with different languages to a final file with reviews in English. Thus, the number of reviews has slightly decreased, in the case of Apple from 12,965 obs to 12,712 obs, and in the case of Samsung from 2,606 obs to 2,585.

Exploratory Data Analysis

Tokenization and Cleaning

Now that we have all our text in English solely, we can start processing our dataset to transform it into a corpus. The second step is to create tokens - each tokens will be assigned a word - and to get rid of non conforming formats. Thereby, we are removing in our corpus any punctuation; symbols; numbers; separator. To have a better analysis we also decided to remove “stop words” which correspond to parasite words - in other terms, words such as “a” “the” etc… that do not add value to the analysis. Finally instead of using a steeming method, we decided to proceed with a lemmatization technique - this correspond to the usage of a lexicon dictionary that will look for the root of words, in order to get rid of unuseful repetition with minor change - teach / teaching / taught will all be reduced to the root teach. Prior to continue with graphical representation, we also compute the following information:

  • DTM - Document Term matrix corresponds to the number of time a specif terms will appear
  • TF-IDF - Term Frequency Inverse Document frequency is a measure to quantify the relevance of a particular words in a document.
  • Global Frequency corresponds to the frequency of each words for each document by rank.

Below you will find two tables representing the corpus text of smartphones reviews that we are going to use. The first one is representing the corpus summary (the text column shows the Model name with the numeration of each review across the dataset so we could identify them) while the second one is grouped by Model type.

Corpus Summary
Text Types Tokens Sentences Brand Model
iPhone 11 Pro Max_1 26 28 1 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_2 21 28 3 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_3 4 4 1 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_4 22 23 1 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_5 253 636 46 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_6 35 44 2 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_7 18 18 1 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_8 41 50 4 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_9 39 56 3 Apple iPhone 11 Pro Max
iPhone 11 Pro Max_10 15 16 2 Apple iPhone 11 Pro Max
Corpus Group by Model Summary
Text Types Tokens Sentences Brand Model
iPhone 11 7514 181200 9819 Apple iPhone 11
iPhone 11 Pro 6081 107050 5259 Apple iPhone 11 Pro
iPhone 11 Pro Max 5364 85350 4251 Apple iPhone 11 Pro Max
iPhone 12 3490 36055 1868 Apple iPhone 12
iPhone 12 mini 3257 31869 1855 Apple iPhone 12 mini
iPhone 12 Pro 1999 12841 717 Apple iPhone 12 Pro
iPhone 12 Pro Max 990 3806 193 Apple iPhone 12 Pro Max
iPhone 13 1518 7307 397 Apple iPhone 13
iPhone 13 mini 1244 6017 354 Apple iPhone 13 mini
iPhone 13 Pro 656 2096 121 Apple iPhone 13 Pro
iPhone 13 Pro Max 706 2244 107 Apple iPhone 13 Pro Max
Samsung Galaxy 21 Ultra 481 1233 71 Samsung Samsung Galaxy 21 Ultra
Samsung Galaxy 22 Ultra 1846 8822 504 Samsung Samsung Galaxy 22 Ultra
Samsung Galaxy Note 20 Ultra 673 1783 100 Samsung Samsung Galaxy Note 20 Ultra
Samsung Galaxy S20 FE 587 1613 98 Samsung Samsung Galaxy S20 FE
Samsung Galaxy S20 Plus 1264 5095 295 Samsung Samsung Galaxy S20 Plus
Samsung Galaxy S21 FE 7378 111961 6460 Samsung Samsung Galaxy S21 FE
Samsung Galaxy S21 Plus 2806 18504 1089 Samsung Samsung Galaxy S21 Plus
Samsung Galaxy S22 947 3355 186 Samsung Samsung Galaxy S22
Samsung Galaxy S22 Plus 1017 3818 210 Samsung Samsung Galaxy S22 Plus


Plotting frequency

Observing the frequency plot we can analyse that the most common word used was “phone” followed by “battery” and “screen”. While phone comes without surprise as we got our information from phones reviews on amazon, an interesting point came from the two most used words afterwards. Indeed, we might affirm, with this graphs that the hottest topic for a consumers is the battery life of a new phone and the quality of its screen rather than the software behind it nor the features added.

Models Frequency based on DFM - Ungrouped

This graph focused on the frequency of the 5 more common words used for each of the documents available. An interesting thing, to note is that in our dataset more than 80% of the reviews came from apple buyers. Nonetheless, in this sample showed, it appears that most words used are related to android instead of IOS. Another interesting fact to note is that the most common word used all through out the corpus are - samsung, s9, s20, phone, onplus, entry, device, camera, apple.

Models Frequency based on DFM - Grouped by Model

Unsurprisingly we observed that the Models chosen by the DFM on the top part of the chart are the ones with the highest amount of Tokens. The common ground across the models is the impression that customers talked about the phone and in some cases about the seller. As an example, we discover that on the iPhone 11 some of the reviews were about the conditions of the smartphones as they were refurbished and users were praising or complaining about what they had received. Some of the top 5 terms that were used were - screen, scratch, phone, iphone, buy, battery - this concur with the previous discovery. Furthermore, the usage of the term phone is used among all models while scratch is mostly used in iPhone 11.

Models Frequency based on TF-IDF - Grouped by Model

On the bottom part of the chart, we have the same models as seen on top, but the difference lie on the relevance on the token used in regard to the review. This implies that the terms shown explain the main context of the review. Thereby, iPhone 11 model was explained by scratch as the main theme, meanwhile, for Samsung Galaxy S21 FE is all about comparison with the previous model (Samsung Galaxy S20 FE). Also we detected that the iPhone 11 Pro’s reviews followed the same pattern as the iPhone 11, but with the main difference being the terms aesthetic and generic.

Plotting maximum TF_IDF per document

Due to the large amount of reviews, we decided to create a visualization of the max TF-IDF for each documents instead of showing each one of them. Thereby, the following representation tell us that each of those words have at least a large TF-IDF in on document. Analyzing the output, entry seems to be the word having the most relevance followed then by “oneplus” “s9” “s20”. While the words ranked 2 to 4 are all specif to model, it turns out that “entry” is what matter most for a consumers. Entry in the context of purchase might corresponds to leader price.

We applied the same procedure as the chart above, but in this case we chose to group the documents by Model so we could identify which terms have the largest TF-IDF on each smartphone. From the chart, we can interpret the following:

  • The Mini versions of the iPhone models 12 and 13 the largest weighted frequency term is mini.
  • For smartphones models like iPhone 11, iPhone 11 Pro, iPhone 11 Pro Max and iPhone 12, and iPhone 12 Pro, we can say that the terms iphone and scratch are very relevant words used on users reviews.
  • For the last row, we can identify that most important words are related to the model of the smartphone, for example: Samsung Galaxy S21 Plus has the word “S21” on the top of the terms, same for Samsung Galaxy S22, Samsung Galaxy S22 Plus, Samsung Galaxy 20, and Samsung Galaxy 22 Ultra. One interesting point to mention is that Samsung Galaxy S21 FE has for the second most significant term the word S20. Meaning that that most users were comparing the new generation S21 to the old one S20.
  • As for the models iPhone 13 Pro, iPhone 13 Pro Max, Samsung Galaxy Note 20 Ultra, and Samsung Galaxy S20 FE, they have a term which is more specific to each of them (arise, max,ejection, and sm-g780).

Document log frequency

While this log frequency representation is quite messy due to the amount of document used, we can still distinguish the tokens found during the frequency plot representation. Indeed - phone, battery, screen - are the most common words used and are present in nearly every document. On the other hand, we observe - entry - having a lesser document frequency, meaning that it appears less often, while still maintaining a decent log-frequency implying that it is specif to some documents only.

Words Cloud

Another representation of the Document frequency matrix is through a word cloud, where the the most relevant words has the biggest size. From this cloud, we can identify - phone, battery, screen, iPhone, scratch, condition - as the most used words.

Lexical diversity

We now want to take a glance a the diversity of words used. This is an interesting approach as it allows us to see if reviews are rich in vocabulary or if instead, they are repetitive. Due to the number of documents, we will only show a sample. A limitation of this representation is that the lexical diversity is dependent on the size of the sentence, in this case the reviews. Nonetheless it give us insight on how diverse the reviewers is in term of words used. The highest the TTR the more diverse the lexical is.

For instance, what we can see on the chart is that in the model Samsung Galaxy 21 Ultra we note a higher lexical diversity than the rest of the models. If we look at it with a bigger lens, the top models with the highest lexical assortment are from the brand Samsung followed only by Pro version of Apple thirteenth generation. We find that model with the highest number of tokens is iPhone 11, but it is the one with the lowest lexical distinction.

Keyness

To continue with our analysis we decided to applied a Chi-square test of independence between Apple’s and Samsung’s reviews. The purpose of this test is to compare term from a set of document to another, in this case we want to put in perspective terms used from one consumers base to an others. We create a plot with the keyness results, we also added for reference two tables containing the 10 first values from the Chi-square test for both Samsung and Apple as target.

From the graph, we can see that both reviews will have as most common words used their respective brand. The main difference lays in the following exclusive words used. It seems that for Samsung most of the term are related to other models (Apple as reference). Meanwhile, if Apple becomes the target (Samsung as reference), it appears that the most common terms used are - condition, and scratches.

Our intuition behind this pattern, could be due for Samsung to the amount of product they proposed, thus reviewers have a larger set of comparison when reviewing:

  • Galaxy-S
  • Galaxy-A
  • Galaxy-Z
  • Galaxy-Foldables
  • Galaxy-Notes

as for Apple:

  • Iphone
  • Iphone mini
  • Iphone Pro
  • Iphone SE
Key terms of Apple vs Samsung

Chi-square table - Samsung as Target
Samsung as Target
feature chi2 p n_target n_reference
samsung 2326 0 886 47
s20 684 0 241 1
5g 667 0 283 33
galaxy 624 0 254 23
fe 563 0 197 0
fingerprint 475 0 240 54
s21 446 0 169 8
s9 318 0 113 1
reader 316 0 161 37
sd 265 0 96 2
Chi-square table - Apple as Target
Apple as Target
feature chi2 p n_target n_reference
iphone 694 0 2478 84
condition 578 0 2032 64
scratches 571 0 1747 20
apple 241 0 1026 58
renewed 232 0 724 10
battery 229 0 4000 789
perfect 210 0 1180 107
arrived 197 0 995 77
seller 193 0 810 44
product 191 0 1278 141

Sentiment analysis

This section aims at attributing a sentiment score to each review. We have decided to work with the nrc and afinn dictionaries from the tidytext package as well as with the sentimentr package. This approach proposes more granular sentiment scoring than the tidyverse and quanteda packages which only provide positive, negative or negative-positive classification.

We used all three datasets for this analysis, namely the smartphone_reviews_final.csv, Apple_final.csv and Samsung_final.csv. To prepare the data, we decided to tokenize the reviews by lowercasing, removing punctuation and removing numbers. Below is a glimpse to the tokenized version of the reviews of the smartphone_reviews_final.csv dataset which includes all the reviews (Apple and Samsung).

# # Read data
all_reviews <- read.csv(here::here("data/smartphone_reviews_final.csv"))
apple <- read.csv(here::here("data/Apple_final.csv"))
samsung <- read.csv(here::here("data/Samsung_final.csv"))

# # Unnest tokens
all_reviews_token <- all_reviews %>% 
  mutate(review_id = seq(1:nrow(all_reviews))) %>% 
  relocate(review_id, .before = "Brand") %>% 
  unnest_tokens(output = "word",
                input = "Reviews",
                to_lower = TRUE,
                strip_punct = TRUE,
                strip_numeric = TRUE)
  
apple_token <- apple %>% 
  mutate(review_id = seq(1:nrow(apple))) %>% 
  relocate(review_id, .before = "Brand") %>% 
  unnest_tokens(output = "word",
                 input = "Reviews",
                 to_lower = TRUE,
                 strip_punct = TRUE,
                 strip_numeric = TRUE)

samsung_token <- samsung %>% 
  mutate(review_id = seq(1:nrow(samsung))) %>% 
  relocate(review_id, .before = "Brand") %>% 
  unnest_tokens(output = "word",
                input = "Reviews",
                to_lower = TRUE,
                strip_punct = TRUE,
                strip_numeric = TRUE)
#> 'data.frame':    553049 obs. of  4 variables:
#>  $ review_id: int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ Brand    : chr  "Apple" "Apple" "Apple" "Apple" ...
#>  $ Model    : chr  "iPhone 11 Pro Max" "iPhone 11 Pro Max" "iPhone"..
#>  $ word     : chr  "i" "received" "the" "iphone" ...

Apple vs. Samsung

Using the tokenized version of the reviews, we first proceeded to a sentiment analysis using thenrc dictionary from the tidytext package to compare overall sentiment scores between Apple and Samsung reviews. We created our own function called sentiment_function in order to return the data in the desired format. See the function code and its application below.

# # This function runs a sentiment analysis using the 'nrc' dictionary
sentiment_function <- function(token_data, id_column, sentiment_column){
  sentiment_data <- inner_join(token_data, get_sentiments("nrc"),
                                by = c("word" = "word"))
  
  sentiment_matrix <- table(sentiment_data[[id_column]],
                                  sentiment_data[[sentiment_column]])
  return(sentiment_matrix)
}

# # Get sentiment analysis: Apple vs. Samsung
# Sentiment based using "nrc" dictionary
for (i in list(apple_token, samsung_token)){
  # Perform the sentiment analysis
  sentiment_analysis <- sentiment_function(i, "review_id", "sentiment")
  
  # Print the plot
  sentiment_plot <- ggplot(tibble(sentiment = names(colSums(sentiment_analysis)),
                sum_value = colSums(sentiment_analysis)),
         aes(x = reorder(sentiment, -sum_value), y = sum_value)) +
    geom_col() +
    ggtitle(i$Brand[1]) +
    ylab("Number of tokens") +
    xlab("Sentiment") +
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=0.5))
  
  # Assign graph
  assign(paste0("sentiment_", i$Brand[1]), sentiment_plot)
}

This graph shows some signals that Apple reviews contains more sentiments linked to trust and joy than the ones of Samsung. Apple phones also seems to trigger relatively less sadness, fear and disgust compared to its counter part.

Secondly, we proceeded to a sentiment analysis using the afinn dictionary and the sentimentr function. Again, we will show the function we created to return the results in the desired formats as well as an example of its implementation.

# This function runs 2 sentiment analysis using 'afinn' dictionary and sentimentr package
# Note that if the parameters for one of the two (or the two) sentiment analysis
# are empty, it will only run the one for which it has parameters or return NULL

value_function <- function(data_text=NULL, data_token=NULL, id_column=NULL, value_column=NULL){
  data_value_afinn <- NULL
  data_value_sentimentr <- NULL
  
  if (!is.null(data_token) | !is.null(data_text)){
    if (!is.null(data_token)){
      # Get "afinn" sentiment value
      data_value_afinn <- inner_join(data_token, get_sentiments("afinn"),
                                  by = c("word" = "word")) %>%
        group_by(review_id = {{id_column}}) %>% 
        summarise(afinn_value = mean({{value_column}})) %>% 
        melt(id.vars = "review_id")
    }
    
    if (!is.null(data_text)){
      # Get "sentimentr" sentiment value
      data_value_sentimentr <- get_sentences(data_text) %>% 
        sentiment() %>% 
        group_by(review_id = element_id) %>% 
        summarise(sentimentr_value = mean(sentiment)) %>% 
        melt(id.vars = "review_id")
    }
    
    if (!is.null(data_value_afinn) & !is.null(data_value_sentimentr)){
      # Join the data into one table
      data_value <- rbind(data_value_afinn, data_value_sentimentr)
      return(data_value)
      
    } else if (!is.null(data_value_afinn)){
      return(data_value_afinn)
      
    } else if (!is.null(data_value_sentimentr)){
      return(data_value_sentimentr)
      
    } else {
      return(NULL)
    }
  }
}

# Value based using "afinn" dictionary and "sentimentr"
# Valence shifter: https://www.r-bloggers.com/2020/04/sentiment-analysis-in-r-with-sentimentr-that-handles-negation-valence-shifters/
index <- 1
data_list <- list(apple, samsung)
for (i in list(apple_token, samsung_token)){
  # Perform the sentiment value analysis
  value_analysis <- value_function(data_list[[index]]$Reviews, i, review_id, value)
  
  # Print the plot
  value_plot <- ggplot(value_analysis, aes(y = value, fill = variable)) +
      geom_boxplot() +
      ylab("Average sentiment value") +
      ggtitle(i$Brand[1]) +
      scale_y_continuous(breaks = seq(0,5,0.2), limits = c(0,5)) +
      guides(fill=guide_legend("Sentiment method")) +
      theme(axis.text.x = element_blank(),
            axis.ticks.x = element_blank())
  
  # Assign graph
  assign(paste0("value_", i$Brand[1]), value_plot)
  
  # Increment index
  index <- index + 1
}

It is interesting to see that according to the afinn dictionary, both brands have the exact same sentiment distribution. Regarding the sentimentr package, it looks like the sentiment scores are a tiny bit higher for Apple, but the difference is negligible.

Model comparison {.tabset}

‘nrc’ dictionary

‘afinn’ dictionary

‘sentimentr’ package

While there is a lot to say on these graphs, we will list the most interesting outcomes:

  1. The iPhone 11 Pro Max, iPhone 13 Pro and the Samsung Galaxy S21 Plus seem to be the most appreciated phones. Note that the iPhone 13 Pro has a particularly high level of trust according to the nrc dictionary.
  2. Among Apple phones and based on the nrc dictionary, it seems that the iPhone 12 mini is the model that is most associated with negative sentiments.
  3. Overall, the Samsung Galaxy Note 20 Ultra, Samsung Galaxy S22 and Samsung Galaxy S22 Plus seem to be the least appreciated phones as they have a high count of negative tokens and relatively low sentiment scores (afinn and sentimentr).

Apple models: Base vs. Pro vs. Pro Max

Based on those results of the sentimentr package, we can see that Pro Max models seem to have slightly higher sentiment scores than the Pro and Base one. However, the difference seems to be negligible.

Unsupervised Learning

Similarities

For the unsupervised learning analysis, we will be analyzing the similarities between each of the models and the similarities between words that are present in the reviews. Out of the three distance measurements, we will be using Euclidean Distance as our distance/similarity measurement for this project as it compares the shortest distance among objects.

To read the euclidean distance from the matrix is quite complex since there are several models. Therefore, to highlight similarities in an easier way, we create a heatmap representation of the similarities between the reviews.

Looking at the heatmap of the grouped reviews below, we can conclude the following points:

  • iPhone 11 is not similar to any other phone. The closest one to it is the iPhone 11 Pro where the euclidean distance is around 0.5.
  • Samsung Galaxy S21 FE is not similar to any other phone
  • Samsung Galaxy S22 and S22 Plus are the ones that are similar to almost all the remaining phone models, excpet of course to iPhone 11 and 11 Pro.

Further, we have studied the graph that plots the euclidean distances based on the thickness of the line connecting them. The thicker the line, the more dissimilar or further the models are. Therefore, as we have seen in the heat map earlier and based on the thickness of the lines we see, this graphs also confirms our previous findings that Samsung Galaxy S21 FE and iPhone 11 are the most dissimilar models from all the other models.

Clustering of Documents

Moving on to the Clustering of Documents, we have created some dendrograms to show which models are similar to each other and at which stage are they clustered together. We can see here that one of the first clusters that was made was between iPhone 13 Pro and Samsung Galaxy 21 Ultra. Further, as you can see, we can choose to have four (4) clusters. Comparing the results of the dendrogram to our previous results, we can also confirm the dissimilarity level between the models and iPhone 11 and Samsung Galaxy S21 FE as they were added to the cluster in the last two iterations.

For the clustering below, we have chosen the complete linkage method as it provides almost the same results as average linkage method and shows clearly distic=nct clusters.

Analyzing the results of the kmeasns on 4 clusters, we can see that we have the ration of Between Sum of Squares to the Total Sum of Squares equal to around 90% making the ratio of Within Sum of Squares equal to 10%. These results look promising as we would like to increase the Between Sum of Squares and decrease the Within Sum of Squares

Extract the ten words that are the most used.

We have then created the 4 clusters and have chosen to extract the 10 words that are more often used in each cluster to get an insight on what each cluster talks about.

#>      Clust.1   Clust.2   Clust.3    Clust.4
#> 1     iphone   scratch    iphone         fe
#> 2    scratch    iphone   scratch        s20
#> 3     health aesthetic      mini    samsung
#> 4      renew   generic       s21         s8
#> 5  refurbish      mica    health        ram
#> 6     damage     renew     renew      entry
#> 7    speaker       max       s22        s21
#> 8   daughter   buyspry   samsung snapdragon
#> 9   transfer    health refurbish   flagship
#> 10   generic       pro     dirty         s9
#>      Clust.1   Clust.2    Clust.3   Clust.4
#> 1    scratch    iphone         fe    iphone
#> 2     iphone   scratch        s20   scratch
#> 3  aesthetic    health    samsung      mini
#> 4    generic     renew         s8       s21
#> 5       mica refurbish        ram    health
#> 6      renew    damage      entry     renew
#> 7        max   speaker        s21       s22
#> 8    buyspry  daughter snapdragon   samsung
#> 9     health  transfer   flagship refurbish
#> 10       pro   generic         s9     dirty

Similarities between Words

Then we do the similarities by words. We also create a heat map to show the similarities between words. Surprisingly, the heat map below does not show similarities between any of the words. The closest similarity between two words was between “life” and “saver” with around 0.55 cosine angle which makes sense since those two (2) words sometimes come together.

Then we do the similarities by words. We also create a heat map to show the similarities between words. The heat map below shows similarities between all of the words used in the reviews. However, the word “samsung” has the most dissimilarities with the others compared to the other words. Looking into details, to the similarities of the word “samsung”, we can see that the most similar words to it are “camera” and “fast”. Therefore, we can say that, between all the features, what most people commented about in the samsung phone is it’s speed and it’s camera.


On the other hand, if we look at the similarity matrix for the iPhone, we can see that is is similar to almost all the features that were mentioned which means that the consumers have mentioned those features evenly in their reviews. However, it is least similar to the words, “camera”, “fast”, and “card”. Which means that these words were not mentioned a lot in the iPhone reviews with respect to other words.

#>  [1] "phone"     "battery"   "screen"    "buy"       "scratch"  
#>  [6] "iphone"    "condition" "life"      "purchase"  "love"     
#> [11] "day"       "product"   "camera"    "time"      "perfect"  
#> [16] "arrive"    "return"    "excellent" "price"     "charger"  
#> [21] "charge"    "recommend" "apple"     "brand"     "amazon"   
#> [26] "happy"     "issue"     "fast"      "samsung"   "seller"   
#> [31] "unlock"    "box"       "month"     "receive"   "protector"
#> [36] "quality"   "renew"     "bad"       "expect"    "sim"      
#> [41] "refurbish" "money"     "call"      "card"      "device"   
#> [46] "fine"      "review"    "cell"      "original"  "worth"

Clustering Words

The below dendrogram shows how each word is clustered and with the other words. we can c;early see four (4) clusters. As seen previously in the heatmap, the word samsung will be clustered in a cluster by itself as it is the most dissimilar word from the others.

Cooccurence

Cooccurence describes how words occur together which in turn captures the different relationships between words. From there we can see thatthe most coocuuring words together are the words “phone” and “battery”. We can also clearly see that from the dendrogram.

Topic Modeling

We decided to use Topic modeling to discover the abstract “topics” that occur in a collection of documents, in this case the grouped corpus of smartphones Models (Wikipedia,2022). Topic Modeling will help us to identify the context of the documents by detecting similar words patterns inside them, and by clustering those group of words together.

LSA on Term Frequencies (DFM)

Latent Semantic Analysis, or LSA, is one of the techniques that we will use for topic modeling. LSA is a reduction technique that decomposes the DTM into 3 matrices (\(M = 𝑈Σ𝑉^{𝑡}\)), where \(Σ\) represents the strength of the topic, \(𝑈\) the links among the document and every topic, and \(𝑉^{t}\) the links between the terms and each topic.

First, we started by plotting the first dimension to corroborate if it is associated with the document length, as it is known that this happens in LSA dim1. Looking at the result observed on the tab called Dimension 1, we can confirm that it is the case, as we detect that Dimension 1 is negatively correlated. Furthermore, we can see that the iPhone 11 is in the bottom-right hand side of the chart being the one with the highest amount of tokens, while Samsung Galaxy 21 Ultra is located on the top-left hand side with the lowest amount.

On the second tab “Topics 2 and 3”, we interpret which words are the top 5 associated to topics 2 and 3, and the top5 negatively associated to those topics. In the table for Topic 2 we detect that the words “scratch”, “iphone”, “condition”, “battery” and “product” are the ones associated to this topic, while “5g”, “s20”, “camera”, “phone” and “samsung” are negatively linked. We can say that Topic 2 can be identified on models from the brand Apple. On the other hand, in the table for Topic 3 we discover that the main words related to this topic were - pro, samsung, arrive, battery and camera - and the negative associated were brand, buy, iphone, unlock, and phone. This Topic may be seen on model reviews from the brand Samsung and some of the Pro models from Apple.

For a visual representation of the words stated previously on topic 2 and topic 3, we plot the dimensions that correspond to those topics (Dim 2 & 3) in a biplot chart. In the tab “Biplot of Dim 2 and 3” we can confirm the points mentioned before, as we note the same words associated with Dim 2 (Topic 2) and negatively associated, same case for Dim 3 (Topic 3). Samsung Galaxy S21 FE, iPhone 11 Pro Max and iPhone 11 Pro are associated with Topic 3, while iPhone 11 is unconnected to this topic. For Topic 2 we can determine that iPhone 11 Pro, iPhone 11 Pro Max, iPhone 11, and iPhone 12 are related to it.

Dimension 1
Topics 2 and 3
Topic 2
x
scratch 0.337
iphone 0.282
condition 0.269
battery 0.210
product 0.142
5g -0.120
s20 -0.124
camera -0.151
phone -0.349
samsung -0.373
Topic 3
x
pro 0.283
samsung 0.270
arrive 0.248
battery 0.217
camera 0.192
brand -0.110
buy -0.114
iphone -0.123
unlock -0.172
phone -0.232
Biplot of Dim 2 and 3

LSA on Term Frequencies (TF-IDF)

Now, we apply the same approach as above, but in this case we will use the TF-IDF matrix. As we have already explain, the TF-IDF quantifies the relevance of a word in a document. First, we build the LSA object with the textmodel_lsa function from the quanteda.textmodels package with our matrix created for the TF-IDF grouped model reviews(only 5 dimensions). Next, we break down the data for interpretability by considering only the 5 words with the highest values and the 5 with the lowest values. Finally, we plot the Biplot to identify the Topics and its words associated and unrelated.

What we discover is that Topic 2 (Dim 2) is associated to the words “scratch”, “iphone”, “generic”, “health”, “renew” and the models iPhone 11 Pro, iPhone 11, iPhone 11 Pro Max. Thereby, is unrelated to terms such as “s21”, “s8”, “samsung”, “s20”, “fe” and the model Samsung Galaxy S21 FE. For Topic 3 (Dim 3) we see a relation with terms like “aesthetic”, “mica”, “generic”, “pro”, “max” and iPhone 11 Pro, iPhone 11 Pro Max, but unrelated to iPhone 11 and the words “transfer”, “red”, “yellow”, “purple”, “daughter”.

Comparing both (DFM and TF-IDF) LSA approaches, we identify that for Topic 2 the words that are similar are “scratch” and “iphone”. This make sense, as we have seen that the main concern from iPhone reviewers is comparing their new model with previous models (usage of term iphone) or praising/complaining about the current status of the smartphone received (usage of term scratch). For Topic 3, we detect that terms changed significantly, meaning that the context of the topic would be different, with the exception of the word Pro.

LDA

Latent Dirichlet Allocation (LDA) is a topic modeling algorithm that is used to identify the topics present in a collection of documents. It is a generative model that assumes that each document is a mixture of a fixed number of topics, and that each word in the document is associated with one of the topics (Susan Li, 2018).

We decided to incorporate this algorithm to our dataset, so we build the LDA object with the textmodel_lda function from the seededlda package. We wanted to apply the same amount of topics (5 topics) similar than the LSA DTM approach because we tried with 10, 9, and 7 topics but the results were not meaningful.

Below we can observe the top 5 terms appearing on each topic

Top 5 terms per Topic
topic1 topic2 topic3 topic4 topic5
mini phone phone pro samsung
red battery battery arrive 5g
size screen scratch max galaxy
128gb buy iphone aesthetic s20
se camera condition detail fingerprint
Term-Topic Analysis

The \(ϕ\) (phi) is a term-topic distribution. It represents the probability of a term occurring in a given topic. To rephrase it, for a given topic, the \(ϕ\) values indicate the likelihood of each term being associated with that topic. The terms with the highest \(ϕ\) values are the ones that are most strongly associated with the topic. To visualize those associations we plot the \(ϕ\) values with the 10 largest probabilities terms inside each subject.

So we can interpret the following:

  • Topic 1: The reviews found with this topic will talk more about the color of the models, the capacity, and size.
  • Topic 2: It is mainly related to the term phone, so we expect that this topic will be among all models (Topic-Document Analysis).
  • Topic 3: Refers to more to the characteristics and conditions of the smartphones.
  • Topic 4: We can assume that this topic will be associated with some of the iPhone models, as we see the terms “pro” and “max” on the top 3, and maybe some samsung models, due to the words “aesthetic”, detail.
  • Topic 5: The top terms tell us that this topic is oriented to the brand Samsung, as we can identify the main terms are related to Samsung models.

Topic-Document Analysis

The θ matrix (theta) is a document-topic distribution. It represents the probability of a topic occurring in a given document. To put it in another way, for a given document, the θ values indicate the likelihood of each topic being present in that document.

Looking at the below chart, we can identify the following:

  • The models reviews from the first row, the iPhone 13 mini, iPhone 13 Pro, and iPhone 13 Pro Max mostly talk 50% about Topic 3.
  • Unsurprisingly we see that all models have a significant percentage that is about Topic 2.
  • On Samsung models its observed that Topic 4 had a minor appearance. However, for Apple models we see a higher participation.
  • Topic 5 is mainly observed in all Samsung models. Meaning that what we believed on the previous analysis is correct.
  • About Topic 1 is slightly prevalent on the mini models of Apple, the Iphone 13, and the Samsung models S20 Plus and FE.

LDA Diagnostics across Topics - Prevalence, Coherence and Exclusivity -

There are several ways to evaluate the quality of a topic model, for this case, we have consider to evaluate the LDA using metrics such as prevalence, coherence, and exclusivity.

Prevalence is a measure of how frequently a topic appears in the documents. A topic with a high prevalence is likely to be important and relevant to the overall collection of documents, while a topic with a low prevalence may not be as important or relevant.

Coherence is a measure of how well the words within a topic are related to each other. A topic with high coherence is likely to be more interpretable and easier to understand, while a topic with low coherence may be more difficult to interpret.

Exclusivity is a measure of how unique a topic is compared to the other topics in the model. A topic with high exclusivity is likely to be more distinct and easily separable from other topics, while a topic with low exclusivity may overlap with other topics and be harder to distinguish.

On the top left-hand side of the chart below, we can observe that the most prevalent topic among the models is Topic 2, this makes sense, as we have seen on the previous chart that this topic was a the most common. Additionally, we detect that Topic 2 is also the most coherent, whereas the least coherent is Topic 1. But we can note that Topic 1 is the most exclusive because its five terms are more specific to it.

Word Embedding

The word embedding has been applied on our three reviews datasets: smartphone_reviews_final.csv, Apple_final.csv, and Samsung_final.csv. To embed words into 50-dimension vectors, we have decided to apply the word2vec::word2vec function using the cbow method with 30 iterations.

# # Read data
all_reviews <- read.csv(here::here("data/smartphone_reviews_final.csv"))

# # Train word2vec model
# Embed each word into a 50-dimension vector
all_model <- word2vec(tolower(all_reviews$Reviews), type = "cbow", dim = 50, iter = 30)

# Transform the result into a matrix
embedded_words <- as.matrix(all_model)

Here is an overview of the resulting matrix:

#>  num [1:3762, 1:50] -1.58 -0.274 0.689 2.108 0.41 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ word     : chr [1:3762] "doubtful" "great" "them" "serve" ...
#>   ..$ dimension: chr [1:50] "V1" "V2" "V3" "V4" ...

To visualize the results in an interactive fashion, we used the uwot::umap and the package plot_ly. In the following 2D and 3D graphs, only a subset of the words have been plotted in order to facilitate the visualization.

# # Read models
all_model <- read.word2vec(here::here("script/models/word2vec/all_model.bin"))

# # Create interactive plot of words
# source: https://cran.r-project.org/web/packages/word2vec/readme/README.html
embedded_words <- as.matrix(all_model)
viz <- umap(embedded_words, n_neighbors = 25, n_threads = 5, n_components = 3)

# Create the dataframe used for the plot
df  <- data.frame(word = gsub("//.+", "", rownames(embedded_words)), 
                  xpos = gsub(".+//", "", rownames(embedded_words)), 
                  x = viz[, 1], y = viz[, 2], z = viz[, 3],
                  stringsAsFactors = FALSE)

# Subset the dataframe
set.seed(456)
nb_words_to_display <- 1500
df  <- df[sample(1:nrow(df), nb_words_to_display),]

# Interactive 2D plot
graph_2d <- plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = "text", text = ~word) %>% 
  layout(title = "2-Dimension word embedding (interactive graph)")

# Interactive 3D plot
graph_3d <- plot_ly(df, x = ~x, y = ~y, z = ~z, type = "scatter3d", mode = 'text', text = ~word) %>% 
  layout(title = "3-Dimension word embedding (interactive graph)")

Document embedding

We pre-processed the documents (reviews) by removing punctuation, lowercasing and splitting them into individual words. We then used the results from the word embedding to embed each review using the following function:

# This function takes 2 arguments: A word vector (= sentence/document) and a matrix of embedded words
# It uses each embedded word values to return the value of the word vector
# Here the input word vector should be the words in a given document
# It assumes that by averaging the vector values of the words found in the
# document it's possible to summarize the information contained in a document
# as a vector 

document_embedding <- function(words, embedded_words_matrix){
  # Keep words present in the embedded_words_matrix
  look_up_words <- words[words %in% rownames(embedded_words_matrix)]
  
  # Document embedding
  # If there is more than one word in the document, do colSums
  if (length(look_up_words) > 1){
    document_embedding <- colMeans(embedded_words_matrix[look_up_words,])
    # If length = 1, don't run colSums as it will throw an error
  } else if (length(look_up_words == 1)){
    document_embedding <- embedded_words_matrix[look_up_words,]
    # If look_up_words is empty return a vector of zeros
  } else {
    document_embedding <- rep(0, ncol(embedded_words_matrix))
  }
  
  # Return embedded document
  return(document_embedding)
}

# # Embed documents (reviews) using document_embedding function
# Remove punctuation
all_reviews_sentences <- gsub('[[:punct:] ]+',' ', all_reviews$Reviews)

# Split sentences into words (to lower and trimmed) - Resulting in a list
document_words <- lapply(strsplit(tolower(all_reviews_sentences), " "), trimws)

# Document embedding using own document_embedding function
embedded_documents <- lapply(X = document_words, FUN = document_embedding, embedded_words_matrix = embedded_words)

Here is an overview of the resulting matrix:

#>  num [1:15297, 1:50] -0.293 -0.629 -0.319 -0.627 -0.278 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ review   : chr [1:15297] "1" "2" "3" "4" ...
#>   ..$ dimension: chr [1:50] "V1" "V2" "V3" "V4" ...

Clustering of documents

In this section, we are using our previously created embedded document matrix in order to cluster documents together. In theory, the clustering should generate groups of similar documents. This is useful to understand the main topics and themes discussed by customers in their reviews.

For the clustering approach, we decided to create a total of 7 groups with nstart=20. The graph below includes a subset of 500 reviews showing the results.

# # Read all reviews
all_reviews <- read.csv(here::here("data/smartphone_reviews_final.csv"))

# # Read embedded document matrix
embedded_documents <- select(read.csv(here::here("script/unsupervised_learning/embedding/embedding_data/embedded_documents.csv")), -X)

# # Clustering of documents
set.seed(650)
number_of_centers <- 7
all_reviews_clustering <- kmeans(embedded_documents, centers = number_of_centers, nstart = 20)

# Store size of clusters
cluster_size <- tibble(cluster = seq(1:number_of_centers),
                       number_of_reviews = all_reviews_clustering$size)

In order to capture the topics discussed in each cluster, we decided to:

  • Tokenize the reviews (remove punctuation, lowercasing and removing numbers)
  • Remove stop words
  • Group by cluster and count word occurrence and frequency
  • Keep words with occurrences higher than 3
  • Keep the top 15 (max) per cluster
  • Graph the results
# Create dataset for plotting: Get top words per cluster
cluster_topics <- all_reviews %>% 
  mutate(review_id = seq(1:nrow(all_reviews))) %>% 
  relocate(review_id, .before = "Brand") %>% 
  unnest_tokens(output = "word",
                input = "Reviews",
                to_lower = TRUE,
                strip_punct = TRUE,
                strip_numeric = TRUE) %>% 
  filter(word %notin% stop_words$word) %>% 
  group_by(cluster, word) %>% 
  tally() %>% 
  mutate(freq = n/sum(n)) %>% 
  filter(n >= 3) %>% 
  arrange(cluster, desc(freq)) %>% 
  top_n(15, freq) %>% 
  left_join(cluster_size, by = c("cluster" = "cluster")) %>% 
  mutate(cluster_header = paste0("Cluster ", cluster, "\nn_reviews = ", number_of_reviews, ", n_words = ", round(n/freq), "\nav_length = ", round(round(n/freq)/number_of_reviews, 2)))

Thanks to the results of the clustering approach, we are able to observe several patterns in the reviews.

  1. Cluster size is not uniformly distributed
  2. The average length of reviews varies from a cluster to another (min = 0.43, max = 19.87)
  3. Cluster 1 seems to group longer reviews and general comments about the battery, screen, place of purchase and return conditions. This cluster might include all the reviews where customers describe their overall experience in order to inform future purchasers.
  4. Cluster 4 is similar to cluster 1 but is slightly more oriented towards the phones’ conditions after a certain usage period (scratch, condition, charger, protector).
  5. Cluster 5 has slightly shorter reviews and seems to have a focus on battery and battery life. Further investigation would be needed but it could be that the overall sentiment of that group of reviews is positive since it also includes several occurrences of words such as excellent, perfect and perfectly.
  6. Cluster 3 seems to group reviews of happy customers that are willing to share on their experiences (‘promoters’).
  7. Cluster 2 and 6 appears to contain comments of happy customers that wanted to express their satisfaction in a very concise manner (short reviews).
  8. Finally, cluster 7 seems to be grouping bad comments about speakers’ quality.

As we can see, it seems that the clustering managed to surface groups of positive comments. However, it is quite surprising to see that the only cluster of negative comments contains only 203 reviews and mostly talks about the quality of the media play and speakers. It would be worth analyzing large clusters that do not seem to surface particular sign of satisfaction to better understand their content (1, 4, 5). Some approaches include increasing the number of clusters or ‘cluster the clusters of interest’.

Supervised learning: Predicting with random forest

This part of the report attempts to use the results of the sentiment analysis and document embedding to train a random forest algorithm to predict reviews’ sentiment. The following dataset (15’297 x 53) has been used to train the algorithm (only the first and last dimension of the document embedded matrix are displayed):

# # Read sentiment score per review
sentimentr_documents <- read.csv(here::here("script/sentiment_analysis/sentiment_data/sentimentr_per_review.csv")) %>% 
  select(review_id, brand = Brand, model = Model, sentimentr_score = sentimentr_value)

# # Read document embedded matrix
embedded_documents <- select(read.csv(here::here("script/unsupervised_learning/embedding/embedding_data/embedded_documents.csv")), -X) %>%
  mutate(review_id = as.integer(rownames(.))) %>% 
  relocate(review_id, .before = V1)

# Join sentiment and embedding matrix to form the dataset
dataset <- left_join(embedded_documents, sentimentr_documents,  by = "review_id") %>% 
  select(-review_id)
dataset$brand <- as.factor(dataset$brand)
dataset$model <- as.factor(dataset$model)
#> 'data.frame':    15297 obs. of  5 variables:
#>  $ V1              : num  -0.293 -0.629 -0.319 -0.627 -0.278 ...
#>  $ V50             : num  0.328 0.188 0.81 0.395 0.113 ...
#>  $ brand           : Factor w/ 2 levels "Apple","Samsung": 1 1 1 1 ..
#>  $ model           : Factor w/ 20 levels "iPhone 11","iPhone 11 Pr"..
#>  $ sentimentr_score: num  0.2157 0.0709 -0.075 0.5162 0.26 ...

The following code has been used to train the random forest (100, 500 and 1000 trees):

# # Training and testing dataset
# Split the data
set.seed(456)
split <- sample.split(dataset, SplitRatio = 0.75)
train <- subset(dataset, split == "TRUE")
test <- subset(dataset, split == "FALSE")

# # Train the model: Random forest regressor
nb_trees <- 100 # 500, 1000
fit <- randomForest(data = train,
                    sentimentr_score ~ .,
                    ntree = nb_trees,
                    mtry = 25,
                    importance=TRUE)

# Save the model
save(fit, file = paste0(here::here("script/models/randomForest/rf_", nb_trees, ".RData")))

The accuracy of the model has been measured on the test set using the caret::RMSE function and results have been plotted in order to visualize prediction accuracy. The plot below contains subset of 2500 reviews out of 4040 in the test set:

# # Evaluate model accuracy on test set
predictions <- predict(fit, test[1:ncol(test)-1])
accuracy <- RMSE(predictions, test[["sentimentr_score"]])

# Data frame including actual values and predictions
results <- tibble(actual = test[["sentimentr_score"]],
                  predictions = predictions) %>% 
  arrange(actual)
results["paired"] <- 1:nrow(results)
results <- melt(results, id.vars = "paired")

# Data to be plotted
set.seed(500)
plot_data <- filter(results, paired %in% sample(1:nrow(results), 5000))
plot_actual <- filter(plot_data, variable == "actual")

This graph shows that the random forest model is performing pretty poorly overall. It performs particularly bad when it comes to predicting extreme sentiment values. Also, it looks like it is predicting between 0.5 and -0.1 at random, no matter this input. One of the main explanation is that the training of the model is based on approximate labelling of the data. Indeed, the sentiment score is given by the sentimentr package which is already limited in accuracy. Also, the embedding of documents is itself based on the assumption that reviews can be summarised in 50-dimension vectors averaging the 50-dimension word vectors it contains. To improve its accuracy, it might be worth trying to increasing the number of dimensions of those vectors in addition to increasing the size of the training set. Furthermore, some improvements could be made on the sentiment score attribution. It could be that some natural language processing deep learning algorithm would perform better in capturing the sentiments of the reviews.

Conclusion

References

Topic model. (2022, September 27). In Wikipedia. Susan Li. (2018, May 31). Towards to Data Science. RS, A.M.R. (2020) Sentiment analysis in R with {sentimentr} that handles negation (valence shifters): R-bloggers, R. Available at: https://www.r-bloggers.com/2020/04/sentiment-analysis-in-r-with-sentimentr-that-handles-negation-valence-shifters/ (Accessed: December 18, 2022). Word2vec (no date) README. Available at: https://cran.r-project.org/web/packages/word2vec/readme/README.html (Accessed: December 18, 2022).